In this hands-on session, we will show how principal component analysis (PCA) is used in common bioinformatics workflows. We will use a toy single-cell RNA-seq (scRNA-seq) dataset to illustrate it.
The main objectives are the following:
prcomp function.Single-cell RNA-seq allows the gene expression profiling of thousands of cells, one a time. It is often exemplified using the following analogy:
The scRNA-seq pipeline can be summarized in these 4 steps:
Image extracted from this paper.
scRNA-seq is a double-edged sword. On the one hand, scRNA-seq provides unprecedented discriminative power to chart all the cell types and states in a given tissue. This has catalyzed the creation of the Human Cell Atlas (HCA), which aims to classify all cells in the human body using scRNA-seq; and has been coined as āthe periodic table of human cellsā.
On the other hand, however, scRNA-seq has a high degree of technical variability, which can be introduced at multiple points of the aforementioned pipeline. Some examples include the following:
In this scenario, PCA is used for 3 main purposes:
As an example, we will create an expression matrix with the following cells:
library(pheatmap)
library(tidyverse)
## āā Attaching packages āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse 1.3.0 āā
## ā ggplot2 3.3.2 ā purrr 0.3.4
## ā tibble 3.0.3 ā dplyr 1.0.2
## ā tidyr 1.1.2 ā stringr 1.4.0
## ā readr 1.3.1 ā forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## āā Conflicts āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse_conflicts() āā
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Create toy dataset
toy <- data.frame(
CD3D = c(4, 5, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
CD3E = c(5, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
LYZ = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
S100A8 = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
CD79A = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 4, 3, 5, 0, 0),
CD79B = c(0, 0, 0, 0, 0, 0, 0, 5, 5, 3, 4, 6, 0, 0),
BLNK = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 5, 5, 4, 0, 0),
TOP2A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 0, 0),
MKI67 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 10, 0, 0),
MT_CO1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 23),
MT_ND5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 25)
)
rownames(toy) <- paste("cell", as.character(1:nrow(toy)))
toy <- as.matrix(toy)
# Visualize
pheatmap(toy, cluster_cols = FALSE, cluster_rows = FALSE, angle_col = 45)
In the following image you can visualize the redundancy between CD79A and CD79B. Together, they form the heterodimer CD79; which means that for every molecule of CD79A we need one of CD79B. Under the hood they are the same variable, which should be captured by a single principal component. A good analogy would be to measure the height of a person in feet and in cm.
Images obtained from this link
Three ways to perform PCA in R:
prcomp function.Regardless of the method, we always need to center the data:
toy_centered <- scale(toy, center = TRUE, scale = FALSE)
Now we can compute the 3 ways:
# Eigendecomposition of the covariance matrix
cov_matrix <- (1 / (nrow(toy_centered) - 1)) * (t(toy_centered) %*% toy_centered)
eigen_cov_matrix <- eigen(cov_matrix)
head(eigen_cov_matrix$values)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
eigen_cov_matrix$vectors[1:6, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142 -0.4178419
## [2,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142 -0.4178419
## [3,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525 -0.2728953
## [4,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525 -0.2728953
## [5,] -0.06565032 0.2467930 -0.1395316 -0.42583898 -0.3020970 -0.3628173
## [6,] -0.06999161 0.2975760 -0.1023968 -0.35142006 -0.6180140 0.1555483
pc_mat_by_eigen <- toy %*% eigen_cov_matrix$vectors
pc_mat_by_eigen[1:6, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## cell 1 -0.3259555 -1.1531013 -2.603479 3.3316442 -1.747928 -3.760577
## cell 2 -0.2897383 -1.0249790 -2.314203 2.9614615 -1.553714 -3.342735
## cell 3 -0.2897383 -1.0249790 -2.314203 2.9614615 -1.553714 -3.342735
## cell 4 -0.2535210 -0.8968566 -2.024928 2.5912788 -1.359499 -2.924893
## cell 5 -0.4319892 -1.9495631 5.855971 -0.7543384 -1.254525 -2.728953
## cell 6 -0.6047848 -2.7293883 8.198360 -1.0560738 -1.756335 -3.820534
# SVD
toy_to_svd <- (1 / sqrt(nrow(toy) - 1)) * toy_centered
toy_svd <- svd(toy_to_svd)
head(toy_svd$d ** 2)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
# PCA
pca_out <- prcomp(toy, center = TRUE)
# The singular values, PC scores and gene loadings are in the sdev, x and
# rotation, respectively
head(pca_out$sdev ** 2)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
pca_out$x
## PC1 PC2 PC3 PC4 PC5 PC6
## cell 1 -3.951320 -3.9486085 3.661022353 3.654520583 0.25850086 -0.47150505
## cell 2 -3.915103 -3.8204861 3.371746930 3.284337897 0.06428666 -0.05366314
## cell 3 -3.915103 -3.8204861 3.371746930 3.284337897 0.06428666 -0.05366314
## cell 4 -3.878885 -3.6923637 3.082471507 2.914155211 -0.12992755 0.36417876
## cell 5 -4.057353 -4.7450703 -4.798427533 -0.431462021 -0.23490185 0.56011951
## cell 6 -4.230149 -5.5248955 -7.140815964 -0.733197393 0.26690820 -0.53146153
## cell 7 -4.143751 -5.1349829 -5.969621749 -0.582329707 0.01600317 0.01432899
## cell 8 -4.800316 1.8496472 3.205914736 -6.480809141 0.07373512 -0.80981026
## cell 9 -4.525318 0.7972224 2.660452378 -4.798614066 0.58270447 0.81573100
## cell 10 -4.457184 0.4814898 2.588858340 -4.511032499 -1.20990530 0.05468109
## cell 11 -6.171637 13.6435992 -2.534752330 3.489637251 -1.30875857 -0.10273060
## cell 12 -6.073610 12.1698727 -1.293445808 0.955715383 1.57894297 0.11540916
## cell 13 24.327990 0.5451695 0.001508073 0.008452255 -0.18771233 0.65946431
## cell 14 29.791738 1.1998922 -0.206657864 -0.053711651 0.16583749 -0.56107911
## PC7 PC8 PC9 PC10 PC11
## cell 1 -7.071068e-01 0.123135176 -0.175028076 -0.055146264 -8.459997e-17
## cell 2 1.414214e+00 0.003214116 0.002251261 0.003483378 -3.066446e-16
## cell 3 8.703355e-16 0.003214116 0.002251261 0.003483378 2.642233e-17
## cell 4 -7.071068e-01 -0.116706945 0.179530598 0.062113019 4.705115e-16
## cell 5 -2.002961e-14 -0.168732189 0.254045449 0.085826011 -3.066446e-16
## cell 6 2.232540e-14 0.146236682 -0.212530754 -0.068851789 -7.507338e-16
## cell 7 1.147891e-15 -0.011247753 0.020757348 0.008487111 -7.507338e-16
## cell 8 3.109616e-14 0.307995407 -0.051643310 0.116372232 1.104971e-16
## cell 9 -3.118735e-14 -0.461037674 -0.470049370 -0.030457398 8.107271e-16
## cell 10 -1.794200e-15 0.018243159 0.411680421 -0.120109433 2.426507e-16
## cell 11 4.950405e-15 -0.120543044 -0.283243462 0.023737601 3.613459e-16
## cell 12 -4.403224e-15 0.149102392 0.345292787 -0.028677522 1.860972e-16
## cell 13 -2.777342e-14 0.769768895 -0.136118692 -0.001458603 5.634938e-17
## cell 14 2.263071e-14 -0.642642339 0.112804538 0.001198280 3.788446e-16
pca_out$rotation
## PC1 PC2 PC3 PC4 PC5
## CD3D -0.03621728 -0.12812237 0.28927542 0.370182686 0.194214205
## CD3E -0.03621728 -0.12812237 0.28927542 0.370182686 0.194214205
## LYZ -0.04319892 -0.19495631 -0.58559711 -0.075433843 0.125452513
## S100A8 -0.04319892 -0.19495631 -0.58559711 -0.075433843 0.125452513
## CD79A -0.06565032 0.24679303 0.13953157 -0.425838976 0.302097034
## CD79B -0.06999161 0.29757597 0.10239682 -0.351420065 0.618014029
## BLNK -0.07184864 0.27941939 0.13319961 -0.415258562 -0.556581707
## TOP2A -0.07437979 0.57153188 -0.22431747 0.353096905 -0.237385930
## MKI67 -0.07432314 0.56844945 -0.21769693 0.335371635 0.221260081
## MT_CO1 0.64897128 0.07824766 -0.02520615 -0.007577383 0.069159859
## MT_ND5 0.73963051 0.08782812 -0.02737840 -0.008092330 0.002583509
## PC6 PC7 PC8 PC9 PC10
## CD3D -0.417841906 7.071068e-01 0.11992106 -0.1772793 -0.05862964
## CD3E -0.417841906 -7.071068e-01 0.11992106 -0.1772793 -0.05862964
## LYZ -0.272895260 1.082467e-14 0.07874222 -0.1166441 -0.03866945
## S100A8 -0.272895260 1.035283e-14 0.07874222 -0.1166441 -0.03866945
## CD79A -0.362817278 1.368350e-14 0.32241549 0.4184122 0.48172172
## CD79B 0.155548283 -5.967449e-15 -0.20858989 -0.5454695 -0.15932743
## BLNK -0.449953352 1.745826e-14 0.06210105 -0.2092092 -0.40830690
## TOP2A -0.063532910 2.581269e-15 -0.07320320 -0.3655626 0.54170097
## MKI67 0.007874334 -4.163336e-16 0.11551433 0.4232463 -0.51959231
## MT_CO1 -0.338627354 1.454392e-14 -0.64733602 0.1811808 0.02643551
## MT_ND5 0.157531117 -7.438494e-15 0.60808962 -0.2189936 -0.04317356
## PC11
## CD3D -4.509519e-16
## CD3E -1.153023e-16
## LYZ 7.071068e-01
## S100A8 -7.071068e-01
## CD79A 2.766429e-16
## CD79B -2.934072e-17
## BLNK -6.267579e-16
## TOP2A 6.589544e-16
## MKI67 -6.802521e-16
## MT_CO1 2.499914e-16
## MT_ND5 -3.091539e-16
# PCA centered and scaled
pca_out_scaled <- prcomp(toy, center = TRUE, scale. = TRUE)
Importantly, we obtain the following:
pca_out$rotationpca_out$xpca_out$sdev ** 2In addition, we can get the proportion of variance explained (PVE) and the cumulative proportion with the summary function (more info in this book):
summary(pca_out_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1414 1.6160 1.5448 1.1065 0.28172 0.25075 0.19522
## Proportion of Variance 0.4169 0.2374 0.2170 0.1113 0.00722 0.00572 0.00346
## Cumulative Proportion 0.4169 0.6543 0.8712 0.9826 0.98977 0.99548 0.99895
## PC8 PC9 PC10 PC11
## Standard deviation 0.09752 0.04135 0.01862 1.383e-16
## Proportion of Variance 0.00086 0.00016 0.00003 0.000e+00
## Cumulative Proportion 0.99981 0.99997 1.00000 1.000e+00
To reduce the dimensionality, we will choose the number of PC that explains most of the variance in the dataset. To this end, we use a so-called elbow plot:
# Calculate percentage of variance explained (PVE):
pve <- pca_out_scaled$sdev ** 2 / sum(pca_out$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
pve_gg <- pve_df %>%
ggplot(aes(principal_component, pve)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(pve)) +
labs(x = "Principal Component", y = "Percentage of Variance Explained (%)") +
theme_bw()
pve_gg
Some people prefer to plot the cumulative percentage of explained variance:
summ_pca <- summary(pca_out_scaled)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
Thus, we conclude that the first 4 PC are significant. Let us express each cell as a vector of the first 4 PC:
toy_reduced <- pca_out_scaled$x[, c("PC1", "PC2", "PC3", "PC4")]
gene_loads_reduced <- pca_out_scaled$rotation[, c("PC1", "PC2", "PC3", "PC4")]
pheatmap(toy_reduced, cluster_rows = FALSE, cluster_cols = FALSE, angle_col = 45)
Now we have reduced the dimensionality of the dataset. To interpret each PC, let us inspect the gene loadings:
significant_pcs <- c("PC1", "PC2", "PC3", "PC4")
loadings_gg <- map(significant_pcs, function(x) {
loadings <- gene_loads_reduced[, x]
df <- data.frame(gene = names(loadings), score = loadings)
p <- df %>%
ggplot(aes(fct_reorder(gene, score), score)) +
geom_point() +
labs(x = "", y = x) +
theme_bw() +
coord_flip()
return(p)
})
loadings_gg
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Interpretation of the PC:
Clustering means classifying observations into groups that minimize and maximize intra-group and inter-group distances, respectively.
To that end we need a concept of distance, which basically measures how similar or different two vectors are. In our case we will use Euclidean distance, but be aware that there are a plethora of different options that can yield different clustering results.
Let us start by computing all pairwise distances:
dist_mat <- dist(toy_reduced, method = "euclidean")
pheatmap(dist_mat, cluster_rows = FALSE, cluster_cols = FALSE)
hclust_average <- hclust(dist_mat, method = "average")
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = ""
)
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = ""
)
abline(h = 2.5, col = "red")
hc_clusters <- cutree(hclust_average, h = 2.5)
hc_clusters
## cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10
## 1 1 1 1 2 2 2 3 3 3
## cell 11 cell 12 cell 13 cell 14
## 4 4 5 5
table(hc_clusters)
## hc_clusters
## 1 2 3 4 5
## 4 3 3 2 2
annotation_rows <- data.frame(cluster = as.character(hc_clusters))
rownames(annotation_rows) <- names(hc_clusters)
pheatmap(
toy_centered,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
| Cluster | Cell type |
|---|---|
| 1 | T-cells |
| 2 | Monocytes |
| 3 | Naive B-cells |
| 4 | Plasma Cells |
| 5 | poor-quality cells |
annotated_clusters <- c("T-cells", "Monocytes", "Naive B-cells", "Plasma Cells", "poor-quality cells")
names(annotated_clusters) <- as.character(1:5)
annotation_rows$annotation <- annotated_clusters[annotation_rows$cluster]
pheatmap(
toy_centered,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [5] readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
## [9] tidyverse_1.3.0 pheatmap_1.0.12 BiocStyle_2.16.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.18 haven_2.3.1
## [4] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2
## [7] htmltools_0.5.0 yaml_2.2.1 blob_1.2.1
## [10] rlang_0.4.7 pillar_1.4.6 withr_2.3.0
## [13] glue_1.4.2 DBI_1.1.0 RColorBrewer_1.1-2
## [16] dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
## [19] lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
## [22] cellranger_1.1.0 rvest_0.3.6 evaluate_0.14
## [25] labeling_0.3 knitr_1.30 fansi_0.4.1
## [28] broom_0.7.1 Rcpp_1.0.5 scales_1.1.1
## [31] backports_1.1.10 BiocManager_1.30.10 magick_2.4.0
## [34] jsonlite_1.7.1 farver_2.0.3 fs_1.5.0
## [37] hms_0.5.3 digest_0.6.25 stringi_1.5.3
## [40] bookdown_0.20 grid_4.0.1 cli_2.0.2
## [43] tools_4.0.1 magrittr_1.5 crayon_1.3.4
## [46] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2
## [49] reprex_0.3.0 lubridate_1.7.9 rstudioapi_0.11
## [52] assertthat_0.2.1 rmarkdown_2.4 httr_1.4.2
## [55] R6_2.4.1 compiler_4.0.1